Recommendations with IBM

In this notebook, you will be putting your recommendation skills to use on real data from the IBM Watson Studio platform.

You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page. Either way assure that your code passes the project RUBRIC. Please save regularly.

By following the table of contents, you will build out a number of different methods for making recommendations that can be used for different situations.

Table of Contents

I. Exploratory Data Analysis
II. Rank Based Recommendations
III. User-User Based Collaborative Filtering
IV. Content Based Recommendations (EXTRA - NOT REQUIRED)
V. Matrix Factorization
VI. Extras & Concluding

At the end of the notebook, you will find directions for how to submit your work. Let's get started by importing the necessary libraries and reading in the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import project_tests as t
import pickle
import plotly.graph_objects as go

%matplotlib inline

df = pd.read_csv('../data/user-item-interactions.csv')
df_content = pd.read_csv('../data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']

# Show df to get an idea of the data
df.head()
Out[1]:
article_id title email
0 1430.0 using pixiedust for fast, flexible, and easier... ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1 1314.0 healthcare python streaming application demo 083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2 1429.0 use deep learning for image classification b96a4f2e92d8572034b1e9b28f9ac673765cd074
3 1338.0 ml optimization using cognitive assistant 06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4 1276.0 deploy your python model as a restful api f01220c46fc92c6e6b161b1849de11faacd7ccb2
In [2]:
# Show df_content to get an idea of the data
df_content.head()
Out[2]:
doc_body doc_description doc_full_name doc_status article_id
0 Skip navigation Sign in SearchLoading...\r\n\r... Detect bad readings in real time using Python ... Detect Malfunctioning IoT Sensors with Streami... Live 0
1 No Free Hunch Navigation * kaggle.com\r\n\r\n ... See the forest, see the trees. Here lies the c... Communicating data science: A guide to present... Live 1
2 ☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat... Here’s this week’s news in Data Science and Bi... This Week in Data Science (April 18, 2017) Live 2
3 DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... Learn how distributed DBs solve the problem of... DataLayer Conference: Boost the performance of... Live 3
4 Skip navigation Sign in SearchLoading...\r\n\r... This video demonstrates the power of IBM DataS... Analyze NY Restaurant data using Spark in DSX Live 4

Part I : Exploratory Data Analysis

Use the dictionary and cells below to provide some insight into the descriptive statistics of the data.

1. What is the distribution of how many articles a user interacts with in the dataset? Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.

In [3]:
df.dtypes
Out[3]:
article_id    float64
title          object
email          object
dtype: object
In [4]:
df_content.dtypes
Out[4]:
doc_body           object
doc_description    object
doc_full_name      object
doc_status         object
article_id          int64
dtype: object
In [5]:
df.isnull().mean()
Out[5]:
article_id    0.00000
title         0.00000
email         0.00037
dtype: float64
In [6]:
df_content.isnull().mean()
Out[6]:
doc_body           0.013258
doc_description    0.002841
doc_full_name      0.000000
doc_status         0.000000
article_id         0.000000
dtype: float64
In [7]:
print (len(df['email'].unique()), len(df['article_id'].unique()), df.shape[0])
print (len(df_content[df_content['doc_status'] == 'Live']['article_id'].unique()))
5149 714 45993
1051
In [8]:
dist_user_article_df = (df
                        .groupby(['email'])
                        .count()
                        .sort_values(['article_id'], ascending=False)['article_id']
                        .reset_index()
                       )
dist_user_article_df.columns = ['user', 'art_inter_num']

x = dist_user_article_df['user']
y = dist_user_article_df['art_inter_num']

# Use the hovertext kw argument for hover text
fig = go.Figure(data=[go.Bar(x=x, y=y,
            hovertext=['27% market share', '24% market share', '19% market share'])])
# Customize aspect
fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)

fig.update_layout(title_text='Distribution of User-Article Interaction')
fig.update_xaxes(title='User', showticklabels=False)
fig.update_yaxes(title='Article Interaction Number')

fig.show()
In [9]:
# Fill in the median and maximum number of user_article interactios below

median_val = dist_user_article_df['art_inter_num'].median()       # 50% of individuals interact with ____ number of articles or fewer.
max_views_by_user = dist_user_article_df['art_inter_num'].max()   # The maximum number of user-article interactions by any 1 user is ______.

2. Explore and remove duplicate articles from the df_content dataframe.

In [10]:
# Find and explore duplicate articles
In [11]:
df_content.head()
Out[11]:
doc_body doc_description doc_full_name doc_status article_id
0 Skip navigation Sign in SearchLoading...\r\n\r... Detect bad readings in real time using Python ... Detect Malfunctioning IoT Sensors with Streami... Live 0
1 No Free Hunch Navigation * kaggle.com\r\n\r\n ... See the forest, see the trees. Here lies the c... Communicating data science: A guide to present... Live 1
2 ☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat... Here’s this week’s news in Data Science and Bi... This Week in Data Science (April 18, 2017) Live 2
3 DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... Learn how distributed DBs solve the problem of... DataLayer Conference: Boost the performance of... Live 3
4 Skip navigation Sign in SearchLoading...\r\n\r... This video demonstrates the power of IBM DataS... Analyze NY Restaurant data using Spark in DSX Live 4
In [12]:
print (df_content['doc_status'].unique()) # check the category of doc_status

print (len(df_content['article_id'].unique()), df_content.shape[0]) # check the dunplicated number of articles in df
['Live']
1051 1056
In [13]:
(df_content[df_content.duplicated(subset=['article_id'], keep=False) == True]
 .sort_values(['article_id'],ascending=True))
Out[13]:
doc_body doc_description doc_full_name doc_status article_id
50 Follow Sign in / Sign up Home About Insight Da... Community Detection at Scale Graph-based machine learning Live 50
365 Follow Sign in / Sign up Home About Insight Da... During the seven-week Insight Data Engineering... Graph-based machine learning Live 50
221 * United States\r\n\r\nIBM® * Site map\r\n\r\n... When used to make sense of huge amounts of con... How smart catalogs can turn the big data flood... Live 221
692 Homepage Follow Sign in / Sign up Homepage * H... One of the earliest documented catalogs was co... How smart catalogs can turn the big data flood... Live 221
232 Homepage Follow Sign in Get started Homepage *... If you are like most data scientists, you are ... Self-service data preparation with IBM Data Re... Live 232
971 Homepage Follow Sign in Get started * Home\r\n... If you are like most data scientists, you are ... Self-service data preparation with IBM Data Re... Live 232
399 Homepage Follow Sign in Get started * Home\r\n... Today’s world of data science leverages data f... Using Apache Spark as a parallel processing fr... Live 398
761 Homepage Follow Sign in Get started Homepage *... Today’s world of data science leverages data f... Using Apache Spark as a parallel processing fr... Live 398
578 This video shows you how to construct queries ... This video shows you how to construct queries ... Use the Primary Index Live 577
970 This video shows you how to construct queries ... This video shows you how to construct queries ... Use the Primary Index Live 577
In [14]:
# Remove any rows that have the same article_id - only keep the first
df_content = df_content.drop_duplicates(subset=['article_id'], keep='first')
In [15]:
print (len(df_content['article_id'].unique()), df_content.shape[0]) # check the dunplicated number of articles in df
1051 1051

3. Use the cells below to find:

a. The number of unique articles that have an interaction with a user.
b. The number of unique articles in the dataset (whether they have any interactions or not).
c. The number of unique users in the dataset. (excluding null values)
d. The number of user-article interactions in the dataset.

In [16]:
# a. 
len(df['article_id'].unique())
Out[16]:
714
In [17]:
# b.
print (len(set(df['article_id'].values)))

# same as the answer in project_test but I dont think this is correct
# unless only consider articles are in "Live" status and
# some of articles in "user-item-interactions.csv" dataset could be not in "Live" any more
print (len(set(df_content['article_id'].values))) 

print (len(set.union(set(df['article_id'].values), set(df_content['article_id'].values))))
714
1051
1328
In [18]:
# c.
len(df['email'].unique()) # why it is 5148 in answer? because there are some observations have eail == NaN
Out[18]:
5149
In [19]:
# d.
df.shape[0]
Out[19]:
45993
In [20]:
unique_articles = len(df['article_id'].unique())                             # The number of unique articles that have at least one interaction
total_articles = len(set(df_content['article_id'].values))                   # The number of unique articles on the IBM platform
unique_users = len(df[df['email'].isnull() == False]['email'].unique())      # The number of unique users
user_article_interactions = df.shape[0]                                      # The number of user-article interactions

4. Use the cells below to find the most viewed article_id, as well as how often it was viewed. After talking to the company leaders, the email_mapper function was deemed a reasonable way to map users to ids. There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).

In [21]:
most_viewed_article = (df
                       .groupby(['article_id'])
                       .count()['title']
                       .reset_index()
                       .sort_values(['title'], ascending=False)
                       .head(1)
                      )

most_viewed_article.columns = ['article_id', 'num_viewed']
In [22]:
most_viewed_article_id = str(most_viewed_article['article_id'].values[0])  # The most viewed article in the dataset as a string with one value following the decimal 
max_views =  most_viewed_article['num_viewed'].values[0]            # The most viewed article in the dataset was viewed how many times?
In [23]:
## No need to change the code here - this will be helpful for later parts of the notebook
# Run this cell to map the user email to a user_id column and remove the email column

def email_mapper():
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded

# show header
df.head()
Out[23]:
article_id title user_id
0 1430.0 using pixiedust for fast, flexible, and easier... 1
1 1314.0 healthcare python streaming application demo 2
2 1429.0 use deep learning for image classification 3
3 1338.0 ml optimization using cognitive assistant 4
4 1276.0 deploy your python model as a restful api 5
In [24]:
## If you stored all your results in the variable names above, 
## you shouldn't need to change anything in this cell

sol_1_dict = {
    '`50% of individuals have _____ or fewer interactions.`': median_val,
    '`The total number of user-article interactions in the dataset is ______.`': user_article_interactions,
    '`The maximum number of user-article interactions by any 1 user is ______.`': max_views_by_user,
    '`The most viewed article in the dataset was viewed _____ times.`': max_views,
    '`The article_id of the most viewed article is ______.`': most_viewed_article_id,
    '`The number of unique articles that have at least 1 rating ______.`': unique_articles,
    '`The number of unique users in the dataset is ______`': unique_users,
    '`The number of unique articles on the IBM platform`': total_articles
}

# Test your dictionary against the solution
t.sol_1_test(sol_1_dict)
It looks like you have everything right here! Nice job!

Part II: Rank-Based Recommendations

Unlike in the earlier lessons, we don't actually have ratings for whether a user liked an article or not. We only know that a user has interacted with an article. In these cases, the popularity of an article can really only be based on how often an article was interacted with.

1. Fill in the function below to return the n top articles ordered with most interactions as the top. Test your function using the tests below.

In [25]:
def get_top_articles(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    # Your code here
    top_articles = (df
                    .groupby(['article_id', 'title'])
                    .count()['user_id']
                    .reset_index()
                    .sort_values(['user_id'], ascending=False)
                    .head(n)['title'])
    
    return top_articles # Return the top article titles from df (not df_content)

def get_top_article_ids(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    # Your code here
    top_articles = (df
                    .groupby(['article_id', 'title'])
                    .count()['user_id']
                    .reset_index()
                    .sort_values(['user_id'], ascending=False)
                    .head(n)['article_id'])
 
    return top_articles # Return the top article ids
In [26]:
print(get_top_articles(10))
print(get_top_article_ids(10))
699           use deep learning for image classification
625          insights from new york car accident reports
701                       visualize car data with brunel
697    use xgboost, scikit-learn & ibm watson machine...
652    predicting churn with the spss random tree alg...
614         healthcare python streaming application demo
600    finding optimal locations of new store using d...
526             apache spark lab, part 1: basic concepts
518              analyze energy consumption in buildings
608    gosales transactions for logistic regression m...
Name: title, dtype: object
699    1429.0
625    1330.0
701    1431.0
697    1427.0
652    1364.0
614    1314.0
600    1293.0
526    1170.0
518    1162.0
608    1304.0
Name: article_id, dtype: float64
In [27]:
# Test your function by returning the top 5, 10, and 20 articles
top_5 = get_top_articles(5)
top_10 = get_top_articles(10)
top_20 = get_top_articles(20)

# Test each of your three lists from above
t.sol_2_test(get_top_articles)
Your top_5 looks like the solution list! Nice job.
Your top_10 looks like the solution list! Nice job.
Your top_20 looks like the solution list! Nice job.

Part III: User-User Based Collaborative Filtering

1. Use the function below to reformat the df dataframe to be shaped with users as the rows and articles as the columns.

  • Each user should only appear in each row once.
  • Each article should only show up in one column.
  • If a user has interacted with an article, then place a 1 where the user-row meets for that article-column. It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.
  • If a user has not interacted with an item, then place a zero where the user-row meets for that article-column.

Use the tests to make sure the basic structure of your matrix matches what is expected by the solution.

In [28]:
df.head()
Out[28]:
article_id title user_id
0 1430.0 using pixiedust for fast, flexible, and easier... 1
1 1314.0 healthcare python streaming application demo 2
2 1429.0 use deep learning for image classification 3
3 1338.0 ml optimization using cognitive assistant 4
4 1276.0 deploy your python model as a restful api 5
In [29]:
df[df['user_id'] == 1].sort_values(['article_id'])['article_id'].unique()
Out[29]:
array([  43.,  109.,  151.,  268.,  310.,  329.,  346.,  390.,  494.,
        525.,  585.,  626.,  668.,  732.,  768.,  910.,  968.,  981.,
       1052., 1170., 1183., 1185., 1232., 1293., 1305., 1363., 1368.,
       1391., 1400., 1406., 1427., 1429., 1430., 1431., 1436., 1439.])
In [30]:
# create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    # Fill in the function here
    df['time'] = 1

    user_item = df.pivot_table(index='user_id',
                               columns='article_id',
                               values='time',
                               aggfunc='mean',
                               fill_value=0)
    
    return user_item # return the user_item matrix 

user_item = create_user_item_matrix(df)
In [31]:
## Tests: You should just need to run this cell.  Don't change the code.
assert user_item.shape[0] == 5149, "Oops!  The number of users in the user-article matrix doesn't look right."
assert user_item.shape[1] == 714, "Oops!  The number of articles in the user-article matrix doesn't look right."
assert user_item.sum(axis=1)[1] == 36, "Oops!  The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests!  Please proceed!")
You have passed our quick tests!  Please proceed!

2. Complete the function below which should take a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar). The returned result should not contain the provided user_id, as we know that each user is similar to him/herself. Because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users.

Use the tests to test your function.

In [32]:
def find_similar_users(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # compute similarity of each user to the provided user
    user_vector = user_item[user_item.index == user_id]
    sim_vector = user_vector.dot(user_item.T).T
    sim_vector.columns = ['product_dot_value']
    
    # TDOD: one better solution is first reset index, then order the data frame and finally remove the own one
    
    # sort by similarity
    most_similar_users = sim_vector.sort_values(['product_dot_value'], ascending=False)
    
    # create list of just the ids
    most_similar_users = most_similar_users.reset_index()

    # remove the own user's id
    most_similar_users = most_similar_users[most_similar_users['user_id'] != user_id]
    
    most_similar_users = most_similar_users \
        .sort_values(['product_dot_value', 'user_id'], ascending=[False, True]) \
        .reset_index()
    most_similar_users = most_similar_users.rename(columns={"index": "original_index"})
    
#     return most_similar_users # return a list of the users in order from most to least similar
    
    return list(most_similar_users['user_id'])
        
In [33]:
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))
The 10 most similar users to user 1 are: [3933, 23, 3782, 203, 4459, 131, 3870, 46, 4201, 49]
The 5 most similar users to user 3933 are: [1, 23, 3782, 203, 4459]
The 3 most similar users to user 46 are: [4201, 23, 3782]
In [34]:
def find_similar_users2(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    most_similar_users - (pandas dataframe) an ordered df where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered df with similarity scores
    
    '''
    # compute similarity of each user to the provided user
    user_vector = user_item[user_item.index == user_id]
    sim_vector = user_vector.dot(user_item.T).T
    sim_vector.columns = ['product_dot_value']
    sim_vector.reset_index(level=0, inplace=True)
    
    # remove the own user's id
    most_similar_users = sim_vector[sim_vector['user_id'] != user_id]
    
    most_similar_users = most_similar_users \
        .sort_values(['product_dot_value', 'user_id'], ascending=[False, True]) \
        .reset_index(drop=True)

    return most_similar_users # return a list of the users in order from most to least similar
        
In [35]:
find_similar_users2(1)[:10]
Out[35]:
user_id product_dot_value
0 3933 35
1 23 17
2 3782 17
3 203 15
4 4459 15
5 131 14
6 3870 14
7 46 13
8 4201 13
9 49 12

3. Now that you have a function that provides the most similar users to each user, you will want to use these users to find articles you can recommend. Complete the functions below to return the articles you would recommend to each user.

In [36]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    # Your code here
    df['article_id'] = df['article_id'].astype(str)
    
    article_names = set(df[df['article_id'].isin(article_ids)]['title'])
    
    return article_names # Return the article names associated with list of article ids


def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    # Your code here
    user_item_df = user_item[user_item.index == user_id].T
    user_item_df.columns = ['is_interacted']
    
    article_ids = user_item_df[user_item_df['is_interacted'] == 1].index.values
    article_ids = article_ids.astype(np.str_)
    
    article_names = get_article_names(article_ids)
    
    return article_ids, article_names # return the ids and names


def make_random_numbers(x):
    
    # give a random number rows in one group
    
    total = x['user_id'].count()
    r = np.random.choice(range(99999), total, replace = False)
    
    x['order_in_group'] = r
        
    return x

def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    # Your code here
    
    sim_users = (find_similar_users2(user_id)
                     .groupby('product_dot_value')
                     .apply(lambda x: make_random_numbers(x))
                     .sort_values(['product_dot_value', 'order_in_group'], ascending=False)
                )
    
    # filter articles that the target user has seen before - step 1
    article_seen_ids, article_seen_names = get_user_articles(user_id)
    article_seen_ids = list(article_seen_ids)
    
    recs = []
    
    for idx, row in sim_users.iterrows():
        
        article_ids = []
        
        user_id = row['user_id']
        article_ids_tmp, article_names = get_user_articles(user_id)
        
        # filter articles that the target user has seen before - step 2
        article_ids_tmp = list(article_ids_tmp)
        [article_ids.append(x) for x in article_ids_tmp if x not in article_seen_ids]
        
        recs_size = len(recs)
        ids_size = len(article_ids)

        if (recs_size + ids_size > m):
            '''
            For the user where the number of recommended articles starts below m 
            and ends exceeding m, the last items are chosen arbitrarily
            '''
            
            temp_0 = article_ids[:(m-recs_size-1)] # 0-9
            temp_1 = article_ids[(m-recs_size-1):] # 10
            
            article_ids = temp_0 + [temp_1[np.random.randint(0, len(temp_1)-1)]]
            
        for a_id in article_ids:
            if (a_id not in recs):
                recs.append(a_id)
            
            if (len(recs) == m):
                return recs
        
    return recs # return your recommendations for this user_id    
In [37]:
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1
Out[37]:
{'accelerate your workflow with dsx',
 'deep forest: towards an alternative to deep neural networks',
 'experience iot with coursera',
 'got zip code data? prep it for analytics. – ibm watson data lab – medium',
 'graph-based machine learning',
 'higher-order logistic regression for large datasets',
 'this week in data science (april 18, 2017)',
 'timeseries data analysis of iot events by using jupyter notebook',
 'using brunel in ipython/jupyter notebooks',
 'using machine learning to predict parking difficulty'}
In [38]:
# Test your functions here - No need to change this code - just run this cell
assert set(get_article_names(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_article_names(['1320.0', '232.0', '844.0'])) == set(['housing (2015): united states demographic measures','self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set(['1320.0', '232.0', '844.0'])
assert set(get_user_articles(20)[1]) == set(['housing (2015): united states demographic measures', 'self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook'])
assert set(get_user_articles(2)[0]) == set(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])
assert set(get_user_articles(2)[1]) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis'])
print("If this is all you see, you passed all of our tests!  Nice job!")
If this is all you see, you passed all of our tests!  Nice job!

4. Now we are going to improve the consistency of the user_user_recs function from above.

  • Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions.
  • Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be what would be obtained from the top_articles function you wrote earlier.
In [39]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    # Your code here
    user_item_interact_df = (df[df['user_id'] != user_id].groupby(['user_id'])['article_id'].count().reset_index())
    user_similarity_df = find_similar_users2(user_id)
    
    neighbors_df = pd.merge(
                        user_item_interact_df,
                        user_similarity_df,
                        on=['user_id'],
                        how='inner'
                    )
    
    neighbors_df.columns = ['user_id', 'interact_num', 'similarity_value']
    
    neighbors_df = neighbors_df.sort_values(['similarity_value', 'interact_num'], ascending=False)

    return neighbors_df # Return the dataframe specified in the doc_string

################################################################################################
def get_user_top_articles(user_id, df=df):
    
    top_articles_overall = (df
                       .groupby(['article_id', 'title'])
                       .count()['user_id']
                       .reset_index()
                       .sort_values(['user_id'], ascending=False)
                   )
    top_articles_overall.columns = ['article_id', 'title', 'interact_num']
    
    user_articles = df[df['user_id'] == user_id]['article_id']
    user_articles = user_articles.drop_duplicates(keep='first')
    
    top_articles = pd.merge(top_articles_overall, user_articles, on='article_id', how='inner') \
        .sort_values(['interact_num'], ascending=False)
    return top_articles

def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    # Your code here
    sim_users = get_top_sorted_users(user_id)
    
    # TODO; filter articles that the target user has seen before    
    recs = []
    rec_names = []
    
    for idx, row in sim_users.iterrows():
        
        user_id = row['user_id']
        articles = get_user_top_articles(user_id)
        
        article_ids = list(articles['article_id'])
        
        recs_size = len(recs)
        ids_size = len(article_ids)

        if (recs_size + ids_size > m):
            
            article_ids = article_ids[:(m-recs_size)]
            
        for a_id in article_ids:
            if (a_id not in recs):
                recs.append(a_id)
                
                rec_names += articles[articles['article_id'] == a_id]['title'].values.tolist()
            
            if (len(recs) == m):
                return recs, rec_names
    
    return recs, rec_names
In [40]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)
The top 10 recommendations for user 20 are the following article ids:
['1330.0', '1427.0', '1364.0', '1170.0', '1162.0', '1304.0', '1351.0', '1160.0', '1354.0', '1368.0']

The top 10 recommendations for user 20 are the following article names:
['insights from new york car accident reports', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model', 'model bike sharing data with spss', 'analyze accident reports on amazon emr spark', 'movie recommender system with spark machine learning', 'putting a human face on machine learning']

5. Use your functions from above to correctly fill in the solutions to the dictionary below. Then test your dictionary against the solution. Provide the code you need to answer each following the comments below.

In [41]:
### Tests with a dictionary of results

user1_most_sim = find_similar_users(1)[0] # Find the user that is most similar to user 1 
user131_10th_sim = find_similar_users(131)[9] # Find the 10th most similar user to user 131
In [42]:
## Dictionary Test Here
sol_5_dict = {
    'The user that is most similar to user 1.': user1_most_sim, 
    'The user that is the 10th most similar to user 131': user131_10th_sim,
}

t.sol_5_test(sol_5_dict)
This all looks good!  Nice job!

6. If we were given a new user, which of the above functions would you be able to use to make recommendations? Explain. Can you think of a better way we might make recommendations? Use the cell below to explain a better method for new users.

Provide your response here.

Since there is no any article interaction record that a new user has, it is somehow impossible to predict his taste of preference without any informative data about him. Therefore, articles with a higher interaction rankings (i.e., get_top_articles function) are much highly recommended for that new user.

One better solution is that:

  1. asking categories of articles that the new user prefers.
  2. based on the preference, creating a content based or category based recommendation engine.
  3. providing top n simliar articles to the new user.

7. Using your existing functions, provide the top 10 recommended articles you would provide for the a new user below. You can test your function against our thoughts to make sure we are all on the same page with how we might make a recommendation.

In [43]:
new_user = '0.0'

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 

# To a new user, recommending articles with top interactions (as the most popular ones on the platform)
top_articles = (df
                .groupby(['article_id', 'title'])
                .count()['user_id']
                .reset_index()
                .sort_values(['user_id'], ascending=False)
                .head(10)
               )
new_user_recs = set(top_articles['article_id'])# Your recommendations here
In [44]:
new_user_recs
Out[44]:
{'1162.0',
 '1170.0',
 '1293.0',
 '1304.0',
 '1314.0',
 '1330.0',
 '1364.0',
 '1427.0',
 '1429.0',
 '1431.0'}
In [45]:
set(['1314.0','1429.0','1293.0','1427.0','1162.0','1364.0','1304.0','1170.0','1431.0','1330.0'])
Out[45]:
{'1162.0',
 '1170.0',
 '1293.0',
 '1304.0',
 '1314.0',
 '1330.0',
 '1364.0',
 '1427.0',
 '1429.0',
 '1431.0'}
In [47]:
assert set(new_user_recs) == set(['1314.0','1429.0','1293.0','1427.0','1162.0','1364.0','1304.0','1170.0','1431.0','1330.0']), "Oops!  It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."

print("That's right!  Nice job!")
That's right!  Nice job!

Part IV: Content Based Recommendations (EXTRA - NOT REQUIRED)

Another method we might use to make recommendations is to perform a ranking of the highest ranked articles associated with some term. You might consider content to be the doc_body, doc_description, or doc_full_name. There isn't one way to create a content based recommendation, especially considering that each of these columns hold content related information.

1. Use the function body below to create a content based recommender. Since there isn't one right answer for this recommendation tactic, no test functions are provided. Feel free to change the function inputs if you decide you want to try a method that requires more input values. The input values are currently set with one idea in mind that you may use to make content based recommendations. One additional idea is that you might want to choose the most popular recommendations that meet your 'content criteria', but again, there is a lot of flexibility in how you might make these recommendations.

This part is NOT REQUIRED to pass this project. However, you may choose to take this on as an extra way to show off your skills.

In [48]:
import re

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download(['punkt', 'wordnet', 'stopwords'])

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
[nltk_data] Downloading package punkt to /Users/kevin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/kevin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/kevin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [49]:
df.head()
Out[49]:
article_id title user_id time
0 1430.0 using pixiedust for fast, flexible, and easier... 1 1
1 1314.0 healthcare python streaming application demo 2 1
2 1429.0 use deep learning for image classification 3 1
3 1338.0 ml optimization using cognitive assistant 4 1
4 1276.0 deploy your python model as a restful api 5 1
In [50]:
df_content.head()
Out[50]:
doc_body doc_description doc_full_name doc_status article_id
0 Skip navigation Sign in SearchLoading...\r\n\r... Detect bad readings in real time using Python ... Detect Malfunctioning IoT Sensors with Streami... Live 0
1 No Free Hunch Navigation * kaggle.com\r\n\r\n ... See the forest, see the trees. Here lies the c... Communicating data science: A guide to present... Live 1
2 ☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat... Here’s this week’s news in Data Science and Bi... This Week in Data Science (April 18, 2017) Live 2
3 DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... Learn how distributed DBs solve the problem of... DataLayer Conference: Boost the performance of... Live 3
4 Skip navigation Sign in SearchLoading...\r\n\r... This video demonstrates the power of IBM DataS... Analyze NY Restaurant data using Spark in DSX Live 4
In [51]:
def tokenize(text):
    
    # 1. remove url and replace url string as 'urlplaceholder'
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    # 2. remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]"," ",text)
        
    # 3. work tokenization
    tokens = word_tokenize(text)
    
    # 4. remove stop words
    tokens = [tok for tok in tokens if tok not in stopwords.words("english")]
    
    lemmatizer = WordNetLemmatizer() # lemmatization method

    clean_tokens = []
    for tok in tokens:
        
        # 3. converting lowercase and removing space in tokens
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

def insert_title(title, doc_full_name):
    
    if (title is np.nan):
        title = doc_full_name
    
    return title

def generate_tfidf_mat(df=df, df_content=df_content):
    '''
    Generate matrix (in a pandas dataframe format) of articles by tokens (i.e., title)
    '''
    
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
    ])
    
    # clean df_content
    df_content = df_content.drop_duplicates(subset=['article_id'], keep='first')
    df_content['article_id'] = df_content['article_id'].apply(lambda x: str(x)+'.0')
    df_content.index.name = None
    
    
    df = df.drop_duplicates(subset=['article_id'], keep='first')

    df_cleansed = pd.merge(df, df_content, on='article_id', how='outer')
    df_cleansed['title'] = df_cleansed.apply(lambda row: insert_title(row['title'], row['doc_full_name']), axis=1)

    df_cleansed.set_index(df_cleansed['article_id'], inplace=True)
    df_cleansed.index.name = None
    
    df_tfidf_mat = pipeline.fit_transform(df_cleansed['title'])
    
    article_token_df = pd.DataFrame.sparse.from_spmatrix(df_tfidf_mat) # row: article id; col: token id
    article_token_df.set_index(df_cleansed['article_id'], inplace=True)
    article_token_df.index.name = None
    
    return df_cleansed, article_token_df
    
df_cleansed, article_token_df = generate_tfidf_mat()
In [52]:
df_cleansed.head()
Out[52]:
article_id title user_id time doc_body doc_description doc_full_name doc_status
1430.0 1430.0 using pixiedust for fast, flexible, and easier... 1.0 1.0 NaN NaN NaN NaN
1314.0 1314.0 healthcare python streaming application demo 2.0 1.0 NaN NaN NaN NaN
1429.0 1429.0 use deep learning for image classification 3.0 1.0 NaN NaN NaN NaN
1338.0 1338.0 ml optimization using cognitive assistant 4.0 1.0 NaN NaN NaN NaN
1276.0 1276.0 deploy your python model as a restful api 5.0 1.0 NaN NaN NaN NaN
In [53]:
article_token_df
Out[53]:
0 1 2 3 4 5 6 7 8 9 ... 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942
1430.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1314.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1429.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1338.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1276.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1040.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1041.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1045.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1046.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1049.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1328 rows × 1943 columns

In [54]:
def find_similar_articles(article_id, article_token_df=article_token_df):
    '''
    INPUT:
    article_id - (int) a article_id
    article_token_df - (pandas dataframe) matrix of articles by tokens (i.e., title)
    
    OUTPUT:
    most_similar_articles - (list) an ordered list where the closest articles (largest dot product articles)
                    are listed first
    
    Description:
    Computes the similarity of every pair of articles based on the dot product
    Returns an ordered list of articles with a higher similarity value
    
    '''
    
    # compute similarity of each article to the provided article
    article_vector = article_token_df[article_token_df.index == article_id]
    sim_vector = article_vector.dot(article_token_df.T).T
    sim_vector.columns = ['product_dot_value']

    
    # create list of just the ids
    sim_vector['article_id'] = sim_vector.index
    
    # remove the own user's id
    most_similar_articles = sim_vector[sim_vector['article_id'] != article_id]
    
    
    # sort by similarity
    most_similar_articles = most_similar_articles \
        .sort_values(['product_dot_value', 'article_id'], ascending=[False, True])
    
    return most_similar_articles # return a list of the most_similar_articles in order from most to least similar
        
In [55]:
df_content.head()
Out[55]:
doc_body doc_description doc_full_name doc_status article_id
0 Skip navigation Sign in SearchLoading...\r\n\r... Detect bad readings in real time using Python ... Detect Malfunctioning IoT Sensors with Streami... Live 0
1 No Free Hunch Navigation * kaggle.com\r\n\r\n ... See the forest, see the trees. Here lies the c... Communicating data science: A guide to present... Live 1
2 ☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat... Here’s this week’s news in Data Science and Bi... This Week in Data Science (April 18, 2017) Live 2
3 DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... Learn how distributed DBs solve the problem of... DataLayer Conference: Boost the performance of... Live 3
4 Skip navigation Sign in SearchLoading...\r\n\r... This video demonstrates the power of IBM DataS... Analyze NY Restaurant data using Spark in DSX Live 4
In [56]:
def make_content_recs(user_id, n=10, df=df, df_content=df_content, df_cleansed=df_cleansed):
    '''
    INPUT:
    user_id - (int) a user_id
    m - (int) the number of recommendations you want for the user
    df - pandas dataframe with article_id, title, user_id columns
    df_content - pandas dataframe with doc_body, doc_description, doc_full_name, doc_status, article_id
    df_cleansed - pandas dataframe with article_id, title, user_id, time, doc_body, doc_description, doc_full_name, doc_status;
                  merged from df and df_content
    
    OUTPUT:
    recs_id - (list) a list of recommendations for the user by article id
    recs_name - list) a list of recommendations for the user by article title
    
    Description:
    Makes content based recommendations for a specific input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until n recommendations are found
    
    '''
    
    # 1. Get article ids that user has already seen; the list ordered by interaction rank
    article_seen_ids = list(df[df['user_id'] == user_id]['article_id'].values)
    
    # Get the user-article interaction rank
    interact_ranked_df = (df
                          .groupby(['article_id', 'title'])
                          .count()['user_id']
                          .reset_index()
                          .sort_values(['user_id'], ascending=False))
    interact_ranked_df.rename(columns={'user_id': 'interact_num'}, inplace=True)
    
    article_seen_ids = interact_ranked_df[interact_ranked_df['article_id'].isin(article_seen_ids)]['article_id']
#     article_seen_ids = ['0.0', '2.0']

    # Get a list of articles are in 'Live' status
    articles_in_live = [str(art_id)+'.0' for art_id in df_content['article_id'].unique()]
    
    # 2. Make recommendation based on articles already seen 
    recs_id = []
    recs_name = []
    candidate_articles = []
    
    # Loops => get recommendation based on seen articles with a similar content
    for art_id in article_seen_ids:
        
        # Get articles similar to the target article ranked by product dot value (i.e., similarity value)
        most_similar_articles_all = find_similar_articles(art_id)
        
        # Filter top n articles still in "Live" status
        most_similar_articles = most_similar_articles_all[~most_similar_articles_all['article_id']
                                                              .isin(articles_in_live)].head(n).reset_index()
        
        # Get top n articles names
        most_similar_articles['article_name'] = list(set(df_cleansed[df_cleansed['article_id']
                                                                    .isin(most_similar_articles['article_id'])]['title']))

        # The most similar article directly added to the recommendations list
        recs_id.append(most_similar_articles.loc[0, 'article_id'])
        recs_name.append(most_similar_articles.loc[0, 'article_name'])
        
        if (len(recs_id) == n):
            return (recs_id, recs_name)
        
        # The rest similar articles added to the candicates list
        for idx, row in most_similar_articles.loc[1:, ['article_id', 'article_name']].iterrows():
            candidate_articles.append(list(row.values))
    
    # 3. For the rest, randomly select similar articles from the candidates
    candidate_articles_df = pd.DataFrame(data = candidate_articles, columns = ['article_id', 'article_name'])
    candidate_articles_df.drop_duplicates(subset=['article_id'], keep='first')
    
    recs_size = len(recs_id)
    sample_size = recs_size if recs_size < (n-recs_size) else (n-recs_size)
        
    # Randomly select a specified number of rows    
    recs_candicate_articles = candidate_articles_df.sample(n=sample_size) 
    
    # Add articles to the recommendations list
    for idx, row in recs_candicate_articles.iterrows():
        recs_id.append(row['article_id'])
        recs_name.append(row['article_name'])
    
    return (recs_id, recs_name)
In [57]:
make_content_recs(user_id=131, n=10)
Out[57]:
(['1420.0',
  '1181.0',
  '1175.0',
  '1351.0',
  '1274.0',
  '1428.0',
  '1208.0',
  '1305.0',
  '1163.0',
  '1415.0'],
 ['use spark for python to load data and run sql queries',
  'insights from new york car accident reports',
  'learn basics about notebooks and apache spark',
  'overlapping co-cluster recommendation algorithm (ocular)',
  'process events from the watson iot platform in a streams python application',
  'programmatic evaluation using watson conversation',
  'electric power consumption (kwh per capita) by country',
  'a tensorflow regression model to predict house values',
  'access mysql with python',
  'health insurance (2015): united states demographic measures'])

2. Now that you have put together your content-based recommendation system, use the cell below to write a summary explaining how your content based recommender works. Do you see any possible improvements that could be made to your function? Is there anything novel about your content based recommender?

This part is NOT REQUIRED to pass this project. However, you may choose to take this on as an extra way to show off your skills.

Write an explanation of your content based recommendation system here.

The process of my content based recommender executes:

  1. Creating sparse matrix of features (i.e., articles by content tokens) by implementing countvectorizer and tfidf techniques.
  2. Geting articles that the user has already seen, ordered by interaction rank.
  3. Looping these articles and considering each as a group.
  4. For the each group, finding and ranking articles with a similar content, based on dot product value (i.e., content similarity score) between articles' title.
  5. Filtering similar articles that are still in "Live" status in the article community and articles that user has not seen yet.
  6. Finding the article with the highest similarity score within a group and adding it into the recommendation list.
  7. For the rest in each group, adding them to a candidates list.
  8. Finally, randomly selecting articles from the candidates (i.e, similar article candidates for all groups) until n recommendations are found.

3. Use your content-recommendation system to make recommendations for the below scenarios based on the comments. Again no tests are provided here, because there isn't one right answer that could be used to find these content based recommendations.

This part is NOT REQUIRED to pass this project. However, you may choose to take this on as an extra way to show off your skills.

In [58]:
# make recommendations for a brand new user
print("recommendations for a brand new user:\n")
print(list(get_top_article_ids(10, df=df)))
print(list(get_top_articles(10, df=df)))

print ('############################################################################################################ \n')

# make a recommendations for a user who only has interacted with article id '1427.0'
print("Recommendation for a user who only has interacted with article id '1427.0':\n")
similar_articles_ids = find_similar_articles(article_id='1427.0').head(10)['article_id']
print (list(similar_articles_ids))
print (list(set(df_cleansed[df_cleansed['article_id'].isin(similar_articles_ids)]['title'])))
recommendations for a brand new user:

['1429.0', '1330.0', '1431.0', '1427.0', '1364.0', '1314.0', '1293.0', '1170.0', '1162.0', '1304.0']
['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model']
############################################################################################################ 

Recommendation for a user who only has interacted with article id '1427.0':

['124.0', '809.0', '313.0', '1175.0', '893.0', '161.0', '437.0', '122.0', '1298.0', '80.0']
['python machine learning: scikit-learn tutorial', 'ibm watson machine learning: get started', 'breast cancer detection with xgboost, wml and scikit', 'Use the Machine Learning Library in Spark', 'use the machine learning library', 'Use the Machine Learning Library in IBM Analytics for Apache Spark', 'leverage scikit-learn models with core ml', 'from scikit-learn model to cloud with wml client', 'what is machine learning?', 'watson machine learning for developers']

Part V: Matrix Factorization

In this part of the notebook, you will build use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform.

1. You should have already created a user_item matrix above in question 1 of Part III above. This first question here will just require that you run the cells to get things set up for the rest of Part V of the notebook.

In [59]:
# Load the matrix here
user_item_matrix = pd.read_pickle('../test/user_item_matrix.p')
In [60]:
# quick look at the matrix
user_item_matrix.head()
Out[60]:
article_id 0.0 100.0 1000.0 1004.0 1006.0 1008.0 101.0 1014.0 1015.0 1016.0 ... 977.0 98.0 981.0 984.0 985.0 986.0 990.0 993.0 996.0 997.0
user_id
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 714 columns

2. In this situation, you can use Singular Value Decomposition from numpy on the user-item matrix. Use the cell to perform SVD, and explain why this is different than in the lesson.

In [61]:
# Perform SVD on the User-Item Matrix Here

u, s, vt = np.linalg.svd(user_item_matrix.values, full_matrices=True) # use the built in to get the three matrices
In [62]:
print (u.shape, vt.shape, s.shape)
(5149, 5149) (714, 714) (714,)

Provide your response here.

The data provided in the lessson having data points in null (i.e., missing values), which can not be factorised by a pure SVD method. FunkSVD is an alternative technique that works works well with matric having missing values. However, the user-item matrix generated in this exercies has been initially cleansed as 1's when a user has interacted with an article, 0 otherwise. That is the main reason why here, we can perfrom SVD instead.

3. Now for the tricky part, how do we choose the number of latent features to use? Running the below cell, you can see that as the number of latent features increases, we obtain a lower error rate on making predictions for the 1 and 0 values in the user-item matrix. Run the cell below to get an idea of how the accuracy improves as we increase the number of latent features.

In [63]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
    
    # take dot product
    user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    # compute error for each prediction to actual value
    diffs = np.subtract(user_item_matrix, user_item_est)
    
    # total errors and keep track of them
    err = np.sum(np.sum(np.abs(diffs)))
    sum_errs.append(err)
    
    
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

4. From the above, we can't really be sure how many features to use, because simply having a better way to predict the 1's and 0's of the matrix doesn't exactly give us an indication of if we are able to make good recommendations. Instead, we might split our dataset into a training and test set of data, as shown in the cell below.

Use the code from question 3 to understand the impact on accuracy of the training and test sets of data with different numbers of latent features. Using the split below:

  • How many users can we make predictions for in the test set?
  • How many users are we not able to make predictions for because of the cold start problem?
  • How many articles can we make predictions for in the test set?
  • How many articles are we not able to make predictions for because of the cold start problem?
In [64]:
df_train = df.head(40000)
df_test = df.tail(5993)

def create_test_and_train_user_item(df_train, df_test):
    '''
    INPUT:
    df_train - training dataframe
    df_test - test dataframe
    
    OUTPUT:
    user_item_train - a user-item matrix of the training dataframe 
                      (unique users for each row and unique articles for each column)
    user_item_test - a user-item matrix of the testing dataframe 
                    (unique users for each row and unique articles for each column)
    test_idx - all of the test user ids
    test_arts - all of the test article ids
    
    '''
    # Your code here
    user_item_train = create_user_item_matrix(df_train)
    user_item_test = create_user_item_matrix(df_test)

    test_idx = user_item_test.index
    test_arts = user_item_test.columns
    
    return user_item_train, user_item_test, test_idx, test_arts

user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)
<ipython-input-30-dca1180eaf5b>:16: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [65]:
# 4a
user_item_test.index.isin(user_item_train.index).sum()
Out[65]:
20
In [66]:
# 4b -> cold start problem means "new" users to a platform dont have any "records" (e.g., ratings, interactions...)
len(test_idx) - 20
Out[66]:
662
In [67]:
# 4c
user_item_test.columns.isin(user_item_train.columns).sum()
Out[67]:
574
In [68]:
# 4d
len(test_arts) - 574
Out[68]:
0
In [69]:
# Replace the values in the dictionary below
a = 662 
b = 574 
c = 20 
d = 0 


sol_4_dict = {
    'How many users can we make predictions for in the test set?': c,
    'How many users in the test set are we not able to make predictions for because of the cold start problem?': a,
    'How many articles can we make predictions for in the test set?': b,
    'How many articles in the test set are we not able to make predictions for because of the cold start problem?': d
}

t.sol_4_test(sol_4_dict)
Awesome job!  That's right!  All of the test articles are in the training data, but there are only 20 test users that were also in the training set.  All of the other users that are in the test set we have no data on.  Therefore, we cannot make predictions for these users using SVD.

5. Now use the user_item_train dataset from above to find U, S, and V transpose using SVD. Then find the subset of rows in the user_item_test dataset that you can predict using this matrix decomposition with different numbers of latent features to see how many features makes sense to keep based on the accuracy on the test data. This will require combining what was done in questions 2 - 4.

Use the cells below to explore how well SVD works towards making predictions for recommendations on the test data.

In [70]:
# fit SVD on the user_item_train matrix
u_train, s_train, vt_train = np.linalg.svd(user_item_train, full_matrices=True) # fit svd similar to above then use the cells below
In [71]:
s_train.shape
Out[71]:
(714,)
In [72]:
# Use these cells to see how well you can use the training 
# decomposition to predict on test data
In [73]:
from sklearn.metrics import accuracy_score

def make_prediction(u, s, vt):
    '''
    Calculates SVD values with u, sigma and v (transpose) matices
    '''
    pred = np.round(np.dot(np.dot(u, s), vt))
    
    return pred

def predict_interaction(n, u_train, s_train, vt_train, user_rows, article_rows): 
    '''
    Makes predictions from training set to test set with SVD
    '''
    
    # restructure with k latent features
    u_train_lat, s_train_lat, vt_train_lat = u_train[:, :n], np.diag(s_train[:n]), vt_train[:n, :]
    u_test_lat, vt_test_lat = user_rows[:, :n], article_rows[:n, :]
    
    train_preds = make_prediction(u_train_lat, s_train_lat, vt_train_lat)
    test_preds  = make_prediction(u_test_lat, s_train_lat, vt_test_lat)
    
    return train_preds, test_preds

def generate_matrix(user_item_train, user_item_test, u_train, s_train, vt_train):
    '''
    Generates the training set errors and test set errors matrix
    '''
    
    user_idx = user_item_train.index.isin(user_item_test.index)
    article_idx = user_item_train.columns.isin(user_item_test.columns)
    
    user_item_test2 = user_item_test.loc[user_item_train.index[user_idx==True], user_item_train.columns[article_idx==True]]
    
    user_rows = u_train[user_idx, :]
    article_rows = vt_train[:, article_idx]
    
#     print (user_rows.shape, article_rows.shape)
    
    train_errs = []
    test_errs  = []
    
    for n_features in np.arange(0, 720, 20):

        train_preds, test_preds = predict_interaction(n_features, u_train, s_train, vt_train, user_rows, article_rows)
    
        # compute prediction accuracy
        train_errs.append(accuracy_score(user_item_train.values.flatten(), train_preds.flatten()))
        test_errs.append(accuracy_score(user_item_test2.values.flatten(), test_preds.flatten()))
        
    return train_errs, test_errs

train_errs, test_errs = generate_matrix(user_item_train, user_item_test, u_train, s_train, vt_train)
In [74]:
def plot_train_test_error(train_errs, test_errs):
    '''
    Plots the trend of accuracy on training and test set with different number of latern features
    '''
    plt.figure()
    plt.plot(np.arange(0, 720, 20), train_errs, label='Train')
    plt.plot(np.arange(0, 720, 20), test_errs, label='Test')
    plt.xlabel('Number of Latent Features')
    plt.ylabel('Accuracy')
    plt.title('Accuracy Testing on Train and Test Set with different Number of Latent Features')
    plt.legend()
    plt.show()

plot_train_test_error(train_errs, test_errs)    
In [75]:
# Test with selecting training/testing set randomly

df_train_2 = df.sample(frac = 0.75, random_state=200)
df_test_2 = df.drop(df_train_2.index)

user_item_train_2, user_item_test_2, test_idx_2, test_arts_2 = create_test_and_train_user_item(df_train_2, df_test_2)

u_train_2, s_train_2, vt_train_2 = np.linalg.svd(user_item_train_2, full_matrices=True) 

train_errs_2, test_errs_2 = generate_matrix(user_item_train_2, user_item_test_2, u_train_2, s_train_2, vt_train_2)

plot_train_test_error(train_errs=train_errs_2, test_errs=test_errs_2)  
In [76]:
def explained_variance(sigma, n_components):
    """
    Computes explained variance number of components
    """
    # explained variance
    total_var = np.sum(sigma**2)
    var_exp = np.sum([np.square(i) for i in sigma[:n_components]])
    perc_exp = (var_exp / total_var) * 100
    return round(perc_exp, 4)
In [77]:
explained_variance(sigma=s_train, n_components=300)
Out[77]:
92.0776
In [78]:
explained_variance(sigma=s_train_2, n_components=300)
Out[78]:
91.4342

6. Use the cell below to comment on the results you found in the previous question. Given the circumstances of your results, discuss what you might do to determine if the recommendations you make with any of the above recommendation systems are an improvement to how users currently find articles?

Your response here.

From the cells above, two types of approach in training/test datasets splitting have been used. The first one is having 40000 observations from the top as the traininng set, and the rest of them are the test set. In comparison, the second one is having sample function to randomly subset rows into training or test sets. The other operations in the both approaches are the same. Finally, we have created two plots to compare the training and test errors.

From the plots we can find that both figure has a similar circumstance as the the number latent feature rises up, the accuracy of traning set will also increase gradually and reach the acrrucay score at 1.000. However, the performance on test set will decrease till about 0.965 and 0.985 respectively.

One of the reason for this kind of circumstance is due to the imbalance of common users and articles distribution in training and test datasets. There are just a few users (i.e., 20) are shared among two datasets. As the number of latent features increases, the model becomes more and more overfitted on the traing set and makes predictions on the test set more negatively with lower accuracy.

The are several approaches can be used to make improvements on the recommendation engine such as

  1. scaling up the datasets with more user-article interaction pair records
  2. changing the way of article rankings or making an ensemble of recommendation engine methods
  3. asking feedbacks from online users (on testing environment, not on production environment) for further evalution of how closer the recommendation system can get the "taste" of readers.

A/B Testing Extension

Note: We will first split users into two groups by using cookie-based diversion to track users' movements:

  1. group one (control group) gets the recommending artilces from the original recommendation engine
  2. group two (experiment group) gets the recommending articles from the new recommendation engine based on content/user-item interaction/etc.

Experimental Design steps

1. Constructing an expected user funnel flow

  • a. visit homepage
  • b. browse recommending articles on the page
  • c. select articles wanted
    Note: in this section, we have some atypical events not being expected in this experiment, such as a user who find articles from the search bar etc.

2. Deciding a metric

2a. invariant metric
Number of cookies hitting the homepage should not vary between two groups

2b. evaluation metric

  • 2b.1 ratio of #clicks on recommending articles / #cookies
    We expect an higher number of clicks on the recommended articles by users and a higher ratio of clicks per user in experiment group than the control group value
  • 2b.2 ratio of #refreshing the recommending list/ #cookies
    We expect users in the experiment group could has a lower frequency in refreshing their recommending list and get the articles they want.

3. Performing experiment sizing
At this stage, we need to figure out the feasibility of the experiment in terms of the amount of time to run or the size of two groups should be. For instance, we can use a statistical analysis method to evaluate the resutls. If we could have an statistically significant increase in the daily result of the ratio of #clicks on recommending article / #cookies at an overall 5% Type I Error, it would be good enough to deploy the new recommendation engine to users.

4. Checking validity/bias/ethics in experimentation

4a. validity
Since the evaluation metric is aligned with the experiment goal, we do not need to worry about contruct validity; regarding the internal and external validity, we need do trade-off between two types of validity and focus on evaluating the internal over external validity. We could first conduct the research in a controlled environment to establish the existence of a causal relationship and then build a field experiment to analyse if the results hold in the real world.

4b. bias

  • 4b.1 sampling bias
    We are expecting user population in this experiments could be homogeneous
  • 4b.2 novelty bias
    Novelty bias might exist when new users join the platform. They might be curious with any content in the platform and browse through the webpages and articles against any estimated flow. Moreover, since we do not have any user-item interaction records for new users, their preferences in articles might not be perfectly recommended in the first couple of days, in which new users are more likely to spend more time on browsing and refreshing the recommendation list. The results of this experiment could be negatively affected under this kind of phnomenon. Therefore, users' interaction records after 3-5 days will be taken into consideration and evaluated the change of new recommendation engine.

4c. ethics
We do not need to worry about ethics problem as we only collect user-item interactions records and articles data. No any sensitive data related to users will be extracted.

5. analysing data

Extras

Using your workbook, you could now save your recommendations for each user, develop a class to make new predictions and update your results, and make a flask app to deploy your results. These tasks are beyond what is required for this project. However, from what you learned in the lessons, you certainly capable of taking these tasks on to improve upon your work here!

Conclusion

Congratulations! You have reached the end of the Recommendations with IBM project!

Tip: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric. You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

Directions to Submit

Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

Alternatively, you can download this report as .html via the File > Download as submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [185]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Recommendations_with_IBM.ipynb'])
Out[185]:
0
In [ ]: